import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
airdf = pd.read_csv('annual_conc_by_monitor_2020.csv')
airdf
| State Code | County Code | Site Num | Parameter Code | POC | Latitude | Longitude | Datum | Parameter Name | Sample Duration | ... | 75th Percentile | 50th Percentile | 10th Percentile | Local Site Name | Address | State Name | County Name | City Name | CBSA Name | Date of Last Change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | 10 | 44201 | 1 | 30.497478 | -87.880258 | NAD83 | Ozone | 1 HOUR | ... | 0.049 | 0.042 | 0.027 | FAIRHOPE, Alabama | FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE... | Alabama | Baldwin | Fairhope | Daphne-Fairhope-Foley, AL | 22-10-20 |
| 1 | 1 | 3 | 10 | 44201 | 1 | 30.497478 | -87.880258 | NAD83 | Ozone | 8-HR RUN AVG BEGIN HOUR | ... | 0.046 | 0.039 | 0.025 | FAIRHOPE, Alabama | FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE... | Alabama | Baldwin | Fairhope | Daphne-Fairhope-Foley, AL | 22-10-20 |
| 2 | 1 | 3 | 10 | 44201 | 1 | 30.497478 | -87.880258 | NAD83 | Ozone | 8-HR RUN AVG BEGIN HOUR | ... | 0.046 | 0.039 | 0.025 | FAIRHOPE, Alabama | FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE... | Alabama | Baldwin | Fairhope | Daphne-Fairhope-Foley, AL | 22-10-20 |
| 3 | 1 | 3 | 10 | 44201 | 1 | 30.497478 | -87.880258 | NAD83 | Ozone | 8-HR RUN AVG BEGIN HOUR | ... | 0.046 | 0.039 | 0.025 | FAIRHOPE, Alabama | FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE... | Alabama | Baldwin | Fairhope | Daphne-Fairhope-Foley, AL | 22-10-20 |
| 4 | 1 | 3 | 10 | 88101 | 1 | 30.497478 | -87.880258 | NAD83 | PM2.5 - Local Conditions | 24 HOUR | ... | 9.200 | 6.600 | 3.600 | FAIRHOPE, Alabama | FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE... | Alabama | Baldwin | Fairhope | Daphne-Fairhope-Foley, AL | 22-10-20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56409 | 78 | 20 | 1 | 88502 | 1 | 18.334399 | -64.795972 | WGS84 | Acceptable PM2.5 AQI & Speciation Mass | 24 HOUR | ... | 5.900 | 3.700 | 1.900 | CRUZ BAY, ST JOHN U.S. VIRGIN ISLANDS | ST JOHN, VIRGIN ISLANDS | Virgin Islands | St John | Not in a city | NaN | 20-10-20 |
| 56410 | 80 | 26 | 8012 | 44201 | 1 | 32.466389 | -114.768611 | WGS84 | Ozone | 1 HOUR | ... | 0.055 | 0.048 | 0.039 | San Luis Rio Colorado Well 10 | Calle 15 & Venustiano Carranza Water Well #10 | Country Of Mexico | SONORA | NaN | NaN | 14-10-20 |
| 56411 | 80 | 26 | 8012 | 44201 | 1 | 32.466389 | -114.768611 | WGS84 | Ozone | 8-HR RUN AVG BEGIN HOUR | ... | 0.050 | 0.044 | 0.033 | San Luis Rio Colorado Well 10 | Calle 15 & Venustiano Carranza Water Well #10 | Country Of Mexico | SONORA | NaN | NaN | 14-10-20 |
| 56412 | 80 | 26 | 8012 | 44201 | 1 | 32.466389 | -114.768611 | WGS84 | Ozone | 8-HR RUN AVG BEGIN HOUR | ... | 0.050 | 0.044 | 0.033 | San Luis Rio Colorado Well 10 | Calle 15 & Venustiano Carranza Water Well #10 | Country Of Mexico | SONORA | NaN | NaN | 14-10-20 |
| 56413 | 80 | 26 | 8012 | 44201 | 1 | 32.466389 | -114.768611 | WGS84 | Ozone | 8-HR RUN AVG BEGIN HOUR | ... | 0.050 | 0.044 | 0.033 | San Luis Rio Colorado Well 10 | Calle 15 & Venustiano Carranza Water Well #10 | Country Of Mexico | SONORA | NaN | NaN | 14-10-20 |
56414 rows × 55 columns
airdf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 56414 entries, 0 to 56413 Data columns (total 55 columns): State Code 56414 non-null int64 County Code 56414 non-null int64 Site Num 56414 non-null int64 Parameter Code 56414 non-null int64 POC 56414 non-null int64 Latitude 56414 non-null float64 Longitude 56414 non-null float64 Datum 56414 non-null object Parameter Name 56414 non-null object Sample Duration 56414 non-null object Pollutant Standard 15323 non-null object Metric Used 56414 non-null object Method Name 47655 non-null object Year 56414 non-null int64 Units of Measure 56414 non-null object Event Type 56414 non-null object Observation Count 56414 non-null int64 Observation Percent 56414 non-null int64 Completeness Indicator 56414 non-null object Valid Day Count 56414 non-null int64 Required Day Count 56414 non-null int64 Exceptional Data Count 56414 non-null int64 Null Data Count 56414 non-null int64 Primary Exceedance Count 10953 non-null float64 Secondary Exceedance Count 9984 non-null float64 Certification Indicator 56414 non-null object Num Obs Below MDL 56414 non-null int64 Arithmetic Mean 56414 non-null float64 Arithmetic Standard Dev 56414 non-null float64 1st Max Value 56414 non-null float64 1st Max DateTime 56414 non-null object 2nd Max Value 56204 non-null float64 2nd Max DateTime 56204 non-null object 3rd Max Value 56009 non-null float64 3rd Max DateTime 56009 non-null object 4th Max Value 55798 non-null float64 4th Max DateTime 55798 non-null object 1st Max Non Overlapping Value 265 non-null float64 1st NO Max DateTime 265 non-null object 2nd Max Non Overlapping Value 265 non-null float64 2nd NO Max DateTime 265 non-null object 99th Percentile 56414 non-null float64 98th Percentile 56414 non-null float64 95th Percentile 56414 non-null float64 90th Percentile 56414 non-null float64 75th Percentile 56414 non-null float64 50th Percentile 56414 non-null float64 10th Percentile 56414 non-null float64 Local Site Name 51797 non-null object Address 56414 non-null object State Name 56414 non-null object County Name 56414 non-null object City Name 43626 non-null object CBSA Name 50405 non-null object Date of Last Change 56414 non-null object dtypes: float64(19), int64(13), object(23) memory usage: 23.7+ MB
airdf.describe()
| State Code | County Code | Site Num | Parameter Code | POC | Latitude | Longitude | Year | Observation Count | Observation Percent | ... | 4th Max Value | 1st Max Non Overlapping Value | 2nd Max Non Overlapping Value | 99th Percentile | 98th Percentile | 95th Percentile | 90th Percentile | 75th Percentile | 50th Percentile | 10th Percentile | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.0 | 56414.000000 | 56414.000000 | ... | 55798.000000 | 265.000000 | 265.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 | 56414.000000 |
| mean | 28.309622 | 72.657691 | 1242.213387 | 67548.772415 | 2.979119 | 38.886443 | -96.184675 | 2020.0 | 1619.453398 | 43.150619 | ... | 63.484882 | 1.021132 | 0.912075 | 52.840958 | 49.931155 | 45.746594 | 42.763085 | 38.585976 | 34.818407 | 29.350281 |
| std | 16.847989 | 83.863108 | 2567.456814 | 21281.367225 | 5.494554 | 5.773765 | 18.354673 | 0.0 | 4422.138439 | 21.620013 | ... | 1235.368975 | 0.604972 | 0.533788 | 752.141284 | 654.739515 | 515.042968 | 411.799271 | 296.763047 | 220.841815 | 166.709893 |
| min | 1.000000 | 1.000000 | 1.000000 | 11101.000000 | 1.000000 | 17.712474 | -161.767000 | 2020.0 | 0.000000 | 0.000000 | ... | -27.600000 | 0.000000 | 0.000000 | -13.900000 | -13.900000 | -13.900000 | -13.900000 | -24.200000 | -28.600000 | -36.000000 |
| 25% | 12.000000 | 21.000000 | 8.000000 | 44201.000000 | 1.000000 | 35.256510 | -111.894167 | 2020.0 | 20.000000 | 23.000000 | ... | 0.055000 | 0.600000 | 0.600000 | 0.065000 | 0.062000 | 0.056000 | 0.051000 | 0.040000 | 0.012000 | 0.000000 |
| 50% | 30.000000 | 51.000000 | 34.000000 | 68111.000000 | 1.000000 | 39.521933 | -91.179219 | 2020.0 | 48.000000 | 44.000000 | ... | 0.800000 | 0.900000 | 0.800000 | 1.100000 | 1.016345 | 0.900000 | 0.700000 | 0.450000 | 0.270000 | 0.100000 |
| 75% | 42.000000 | 97.000000 | 1010.000000 | 88131.000000 | 5.000000 | 41.912739 | -81.454926 | 2020.0 | 3058.000000 | 57.000000 | ... | 17.100000 | 1.300000 | 1.100000 | 19.000000 | 16.500000 | 13.400000 | 11.200000 | 8.300000 | 5.700000 | 2.300000 |
| max | 80.000000 | 840.000000 | 9997.000000 | 88503.000000 | 99.000000 | 68.637233 | -64.784868 | 2020.0 | 85075.000000 | 100.000000 | ... | 184250.000000 | 3.800000 | 3.600000 | 120658.000000 | 105458.000000 | 80608.000000 | 61450.000000 | 40467.000000 | 24775.000000 | 10538.000000 |
8 rows × 32 columns
airdf.isnull().sum()
State Code 0 County Code 0 Site Num 0 Parameter Code 0 POC 0 Latitude 0 Longitude 0 Datum 0 Parameter Name 0 Sample Duration 0 Pollutant Standard 41091 Metric Used 0 Method Name 8759 Year 0 Units of Measure 0 Event Type 0 Observation Count 0 Observation Percent 0 Completeness Indicator 0 Valid Day Count 0 Required Day Count 0 Exceptional Data Count 0 Null Data Count 0 Primary Exceedance Count 45461 Secondary Exceedance Count 46430 Certification Indicator 0 Num Obs Below MDL 0 Arithmetic Mean 0 Arithmetic Standard Dev 0 1st Max Value 0 1st Max DateTime 0 2nd Max Value 210 2nd Max DateTime 210 3rd Max Value 405 3rd Max DateTime 405 4th Max Value 616 4th Max DateTime 616 1st Max Non Overlapping Value 56149 1st NO Max DateTime 56149 2nd Max Non Overlapping Value 56149 2nd NO Max DateTime 56149 99th Percentile 0 98th Percentile 0 95th Percentile 0 90th Percentile 0 75th Percentile 0 50th Percentile 0 10th Percentile 0 Local Site Name 4617 Address 0 State Name 0 County Name 0 City Name 12788 CBSA Name 6009 Date of Last Change 0 dtype: int64
airdf = airdf.drop(['Pollutant Standard','Primary Exceedance Count','Secondary Exceedance Count','1st Max Non Overlapping Value','1st NO Max DateTime','2nd Max Non Overlapping Value','2nd NO Max DateTime','Method Name','Date of Last Change','Local Site Name','Address'], axis=1)
airdf = airdf.drop(['Local Site Name'], axis=1)
airdf.isnull().sum()
State Code 0 County Code 0 Site Num 0 Parameter Code 0 POC 0 Latitude 0 Longitude 0 Datum 0 Parameter Name 0 Sample Duration 0 Metric Used 0 Year 0 Units of Measure 0 Event Type 0 Observation Count 0 Observation Percent 0 Completeness Indicator 0 Valid Day Count 0 Required Day Count 0 Exceptional Data Count 0 Null Data Count 0 Certification Indicator 0 Num Obs Below MDL 0 Arithmetic Mean 0 Arithmetic Standard Dev 0 1st Max Value 0 1st Max DateTime 0 2nd Max Value 210 2nd Max DateTime 210 3rd Max Value 405 3rd Max DateTime 405 4th Max Value 616 4th Max DateTime 616 99th Percentile 0 98th Percentile 0 95th Percentile 0 90th Percentile 0 75th Percentile 0 50th Percentile 0 10th Percentile 0 Local Site Name 4617 Address 0 State Name 0 County Name 0 City Name 12788 CBSA Name 6009 Date of Last Change 0 dtype: int64
airdf = airdf.dropna()
airdf.isnull().sum()
State Code 0 County Code 0 Site Num 0 Parameter Code 0 POC 0 Latitude 0 Longitude 0 Datum 0 Parameter Name 0 Sample Duration 0 Metric Used 0 Year 0 Units of Measure 0 Event Type 0 Observation Count 0 Observation Percent 0 Completeness Indicator 0 Valid Day Count 0 Required Day Count 0 Exceptional Data Count 0 Null Data Count 0 Certification Indicator 0 Num Obs Below MDL 0 Arithmetic Mean 0 Arithmetic Standard Dev 0 1st Max Value 0 1st Max DateTime 0 2nd Max Value 0 2nd Max DateTime 0 3rd Max Value 0 3rd Max DateTime 0 4th Max Value 0 4th Max DateTime 0 99th Percentile 0 98th Percentile 0 95th Percentile 0 90th Percentile 0 75th Percentile 0 50th Percentile 0 10th Percentile 0 State Name 0 County Name 0 City Name 0 CBSA Name 0 dtype: int64
airdf = pd.get_dummies(airdf, drop_first = True)
X = np.array(airdf).astype('float32')
y = np.array(airdf['Valid Day Count']).astype('float32')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size = 0.5)
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
# Train the model
lr_model.fit(X_train, y_train)
LinearRegression()
result = lr_model.score(X_test, y_test)
print("Accuracy : {}".format(result))
Accuracy : 0.9999995843146586
# make predictions on the test data
y_predict = lr_model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 0.056 MSE = 0.0031685263 MAE = 0.02113199 R2 = 0.9999995843146586 Adjusted R2 = 1.0000001539771155
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor()
# Train the model
dt_model.fit(X_train, y_train)
DecisionTreeRegressor()
result = dt_model.score(X_test, y_test)
print("Accuracy : {}".format(result))
Accuracy : 0.9999998675363452
# make predictions on the test data
y_predict = dt_model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 0.032 MSE = 0.0010096930533117932 MAE = 0.0010096930533117932 R2 = 0.9999998675363452 Adjusted R2 = 1.0000000490668528
from sklearn.ensemble import RandomForestRegressor
# Create a model
rf_model = RandomForestRegressor()
# Train the model
rf_model.fit(X_train, y_train)
result = rf_model.score(X_test, y_test)
print("Accuracy : {}".format(result))
# make predictions on the test data
y_predict = rf_model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
import xgboost as xgb
model = xgb.XGBRegressor(objective ='reg:squarederror', learning_rate = 0.1, max_depth = 5, n_estimators = 100)
model.fit(X_train, y_train)
# predict the score of the trained model using the testing dataset
result = model.score(X_test, y_test)
print("Accuracy : {}".format(result))
# make predictions on the test data
y_predict = model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
ervisitsdf = pd.read_excel('ER_Visits_Data.xls')
ervisitsdf
#predict number of ed visits so as to improve health infrastructure and estimate for those
| COUNTY | YEAR | STRATA | STRATA NAME | AGE GROUP | NUMBER OF ED VISITS | AGE-ADJUSTED ED VISIT RATE | |
|---|---|---|---|---|---|---|---|
| 0 | California | 2017 | Total population | All ages | 5 - 17 Years | 191904.0 | 50.4 |
| 1 | Alameda | 2017 | Total population | All ages | 5 - 17 Years | 9939.0 | 64.3 |
| 2 | Alpine | 2017 | Total population | All ages | 5 - 17 Years | 0.0 | 0.0 |
| 3 | Amador | 2017 | Total population | All ages | 5 - 17 Years | 196.0 | 58.4 |
| 4 | Butte | 2017 | Total population | All ages | 5 - 17 Years | 1044.0 | 50.2 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2827 | Tulare | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | NaN | NaN |
| 2828 | Tuolumne | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | NaN | NaN |
| 2829 | Ventura | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | 0.0 | 0.0 |
| 2830 | Yolo | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | NaN | NaN |
| 2831 | Yuba | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | NaN | NaN |
2832 rows × 7 columns
ervisitsdf.describe()
| YEAR | NUMBER OF ED VISITS | AGE-ADJUSTED ED VISIT RATE | |
|---|---|---|---|
| count | 2832.000000 | 2462.000000 | 2442.000000 |
| mean | 2018.500000 | 2277.336718 | 58.783825 |
| std | 1.118231 | 10943.917590 | 85.444069 |
| min | 2017.000000 | 0.000000 | 0.000000 |
| 25% | 2017.750000 | 52.000000 | 30.825000 |
| 50% | 2018.500000 | 238.500000 | 46.450000 |
| 75% | 2019.250000 | 1033.750000 | 68.275000 |
| max | 2020.000000 | 191904.000000 | 3531.000000 |
ervisitsdf.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2832 entries, 0 to 2831 Data columns (total 7 columns): COUNTY 2832 non-null object YEAR 2832 non-null int64 STRATA 2832 non-null object STRATA NAME 2832 non-null object AGE GROUP 2832 non-null object NUMBER OF ED VISITS 2462 non-null float64 AGE-ADJUSTED ED VISIT RATE 2442 non-null float64 dtypes: float64(2), int64(1), object(4) memory usage: 155.0+ KB
ervisitsdf['STRATA'].unique()
array(['Total population', 'Child vs. adult', 'Age groups',
'Race/ethnicity'], dtype=object)
ervisitsdf['STRATA NAME'].unique()
array(['All ages', '0ñ17 years', '18+ years', '0ñ4 years', '5ñ17 years',
'18ñ64 years', '65+ years', 'White', 'Black', 'Hispanic',
'Asian/PI', 'AI/AN'], dtype=object)
ervisitsdf.hist(bins = 30, figsize = (20,20), color = 'r')
/Users/sumit/opt/anaconda3/lib/python3.8/site-packages/pandas/plotting/_matplotlib/tools.py:307: MatplotlibDeprecationWarning: The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead. layout[ax.rowNum, ax.colNum] = ax.get_visible() /Users/sumit/opt/anaconda3/lib/python3.8/site-packages/pandas/plotting/_matplotlib/tools.py:307: MatplotlibDeprecationWarning: The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead. layout[ax.rowNum, ax.colNum] = ax.get_visible() /Users/sumit/opt/anaconda3/lib/python3.8/site-packages/pandas/plotting/_matplotlib/tools.py:313: MatplotlibDeprecationWarning: The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().rowspan.start instead. if not layout[ax.rowNum + 1, ax.colNum]: /Users/sumit/opt/anaconda3/lib/python3.8/site-packages/pandas/plotting/_matplotlib/tools.py:313: MatplotlibDeprecationWarning: The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor releases later. Use ax.get_subplotspec().colspan.start instead. if not layout[ax.rowNum + 1, ax.colNum]:
array([[<AxesSubplot:title={'center':'AGE-ADJUSTED ED VISIT RATE'}>,
<AxesSubplot:title={'center':'NUMBER OF ED VISITS'}>],
[<AxesSubplot:title={'center':'YEAR'}>, <AxesSubplot:>]],
dtype=object)
sns.pairplot(ervisitsdf)
<seaborn.axisgrid.PairGrid at 0x7fd43c9ae4f0>
!pip3 install xgboost
Requirement already satisfied: xgboost in /Users/sumit/opt/anaconda3/lib/python3.8/site-packages (1.4.2)
Requirement already satisfied: scipy in /Users/sumit/opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.5.0)
Requirement already satisfied: numpy in /Users/sumit/opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.19.2)
WARNING: You are using pip version 21.1.1; however, version 21.1.2 is available.
You should consider upgrading via the '/Users/sumit/opt/anaconda3/bin/python -m pip install --upgrade pip' command.
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
Imported AutoViz_Class version: 0.0.81. Call using:
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=0,
lowess=False,chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30)
Note: verbose=0 or 1 generates charts and displays them in your local Jupyter notebook.
verbose=2 saves plots in your local machine under AutoViz_Plots directory and does not display charts.
filename = "ER_Visits_Data.xls"
sep = ","
dft = AV.AutoViz(
filename,
sep=",",
depVar="",
dfte=None,
header=0,
verbose=0,
lowess=False,
chart_format="svg",
max_rows_analyzed=150000,
max_cols_analyzed=30,
)
Shape of your Data Set: (2832, 7)
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
Number of Numeric Columns = 2
Number of Integer-Categorical Columns = 0
Number of String-Categorical Columns = 3
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 0
Number of Numeric-Boolean Columns = 0
Number of Discrete String Columns = 1
Number of NLP String Columns = 0
Number of Date Time Columns = 1
Number of ID Columns = 0
Number of Columns to Delete = 0
7 Predictors classified...
This does not include the Target column(s)
1 variables removed since they were ID or low-information variables
Number of All Scatter Plots = 3
Time to run AutoViz (in seconds) = 4.835 ###################### VISUALIZATION Completed ########################
ervisitsdf.isnull().sum()
COUNTY 0 YEAR 0 STRATA 0 STRATA NAME 0 AGE GROUP 0 NUMBER OF ED VISITS 370 AGE-ADJUSTED ED VISIT RATE 390 dtype: int64
ervisitsdf = ervisitsdf.fillna(0)
ervisitsdf.isnull().sum()
COUNTY 0 YEAR 0 STRATA 0 STRATA NAME 0 AGE GROUP 0 NUMBER OF ED VISITS 0 AGE-ADJUSTED ED VISIT RATE 0 dtype: int64
ervisitsdf
| COUNTY | YEAR | STRATA | STRATA NAME | AGE GROUP | NUMBER OF ED VISITS | AGE-ADJUSTED ED VISIT RATE | |
|---|---|---|---|---|---|---|---|
| 0 | California | 2017 | Total population | All ages | 5 - 17 Years | 191904.0 | 50.4 |
| 1 | Alameda | 2017 | Total population | All ages | 5 - 17 Years | 9939.0 | 64.3 |
| 2 | Alpine | 2017 | Total population | All ages | 5 - 17 Years | 0.0 | 0.0 |
| 3 | Amador | 2017 | Total population | All ages | 5 - 17 Years | 196.0 | 58.4 |
| 4 | Butte | 2017 | Total population | All ages | 5 - 17 Years | 1044.0 | 50.2 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2827 | Tulare | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | 0.0 | 0.0 |
| 2828 | Tuolumne | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | 0.0 | 0.0 |
| 2829 | Ventura | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | 0.0 | 0.0 |
| 2830 | Yolo | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | 0.0 | 0.0 |
| 2831 | Yuba | 2020 | Race/ethnicity | AI/AN | 41 - 64 Years | 0.0 | 0.0 |
2832 rows × 7 columns
corr_matrix = ervisitsdf.corr()
sns.heatmap(corr_matrix, annot = True)
plt.show()
ervisitsdf['Avg_AGE'] = ervisitsdf['AGE GROUP'].apply(lambda x: (int(x.split('-')[0].strip()) + int(x.split('-')[1].split(' Ye')[0].strip()))/2 if '-' in x else x.split('+')[0])
ervisitsdf['Avg_AGE']
0 11
1 11
2 11
3 11
4 11
...
2827 52.5
2828 52.5
2829 52.5
2830 52.5
2831 52.5
Name: Avg_AGE, Length: 2832, dtype: object
ervisitsdf['MAX_AGE'] = ervisitsdf['AGE GROUP'].apply(lambda x: x.split('-')[1].split(' Ye')[0] if '-' in x else x)
ervisitsdf['MAX_AGE']
0 17
1 17
2 17
3 17
4 17
...
2827 64
2828 64
2829 64
2830 64
2831 64
Name: MAX_AGE, Length: 2832, dtype: object
ervisitsdf.drop(['AGE GROUP'], axis=1, inplace=True)
ervisitsdf
| YEAR | NUMBER OF ED VISITS | AGE-ADJUSTED ED VISIT RATE | COUNTY_Alpine | COUNTY_Amador | COUNTY_Butte | COUNTY_Calaveras | COUNTY_California | COUNTY_Colusa | COUNTY_Contra Costa | ... | STRATA NAME_AI/AN | STRATA NAME_All ages | STRATA NAME_Asian/PI | STRATA NAME_Black | STRATA NAME_Hispanic | STRATA NAME_White | Avg_AGE_11.0 | Avg_AGE_29.0 | Avg_AGE_52.5 | Avg_AGE_65 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017 | 191904.0 | 50.4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2017 | 9939.0 | 64.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 2017 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 2017 | 196.0 | 58.4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 2017 | 1044.0 | 50.2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2827 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2828 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2829 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2830 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2831 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2832 rows × 79 columns
ervisitsdf = pd.get_dummies(ervisitsdf, drop_first = True)
finaldf = ervisitsdf.drop(['NUMBER OF ED VISITS'], axis=1)
X = np.array(finaldf).astype('float32')
y = np.array(ervisitsdf['NUMBER OF ED VISITS']).astype('float32')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size = 0.5)
from sklearn.linear_model import LinearRegression
lr_model = LinearRegression()
# Train the model
lr_model.fit(X_train, y_train)
LinearRegression()
result = lr_model.score(X_test, y_test)
print("Accuracy : {}".format(result))
Accuracy : 0.6068326538349964
# make predictions on the test data
y_predict = lr_model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 7729.09 MSE = 59738840.0 MAE = 1723.2789 R2 = 0.6068326538349964 Adjusted R2 = 0.4953160974681954
from sklearn.tree import DecisionTreeRegressor
dt_model = DecisionTreeRegressor()
# Train the model
dt_model.fit(X_train, y_train)
DecisionTreeRegressor()
result = dt_model.score(X_test, y_test)
print("Accuracy : {}".format(result))
Accuracy : 0.9918896651143543
# make predictions on the test data
y_predict = dt_model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 1110.092 MSE = 1232304.836158192 MAE = 298.8926553672316 R2 = 0.9918896651143543 Adjusted R2 = 0.9895892792195166
from sklearn.ensemble import RandomForestRegressor
# Create a model
rf_model = RandomForestRegressor()
# Train the model
rf_model.fit(X_train, y_train)
RandomForestRegressor()
result = rf_model.score(X_test, y_test)
print("Accuracy : {}".format(result))
Accuracy : 0.9821578705794972
# make predictions on the test data
y_predict = rf_model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 1646.505 MSE = 2710978.360605086 MAE = 394.5754237288136 R2 = 0.9821578705794972 Adjusted R2 = 0.9770971938711365
import xgboost as xgb
model = xgb.XGBRegressor(objective ='reg:squarederror', learning_rate = 0.1, max_depth = 5, n_estimators = 100)
model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.1, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
ervisitsdf
| YEAR | NUMBER OF ED VISITS | AGE-ADJUSTED ED VISIT RATE | COUNTY_Alpine | COUNTY_Amador | COUNTY_Butte | COUNTY_Calaveras | COUNTY_California | COUNTY_Colusa | COUNTY_Contra Costa | ... | STRATA NAME_AI/AN | STRATA NAME_All ages | STRATA NAME_Asian/PI | STRATA NAME_Black | STRATA NAME_Hispanic | STRATA NAME_White | Avg_AGE_11.0 | Avg_AGE_29.0 | Avg_AGE_52.5 | Avg_AGE_65 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017 | 191904.0 | 50.4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2017 | 9939.0 | 64.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 2017 | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 2017 | 196.0 | 58.4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 2017 | 1044.0 | 50.2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2827 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2828 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2829 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2830 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2831 | 2020 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2832 rows × 79 columns
# predict the score of the trained model using the testing dataset
result = model.score(X_test, y_test)
print("Accuracy : {}".format(result))
Accuracy : 0.9898078808055406
# make predictions on the test data
y_predict = model.predict(X_test)
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt
k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)
print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)
RMSE = 1244.434 MSE = 1548616.4 MAE = 481.01086 R2 = 0.9898078808055406 Adjusted R2 = 0.9869170251794758
y_predict1 = model.predict(X_train)
r2
r2
0.995024491323049
RandomForestRegressor()
filename = "annual_conc_by_monitor_2020.csv"
sep = ","
dft = AV.AutoViz(
filename,
sep=",",
depVar="",
dfte=None,
header=0,
verbose=0,
lowess=False,
chart_format="svg",
max_rows_analyzed=1500000,
max_cols_analyzed=60,
)
Shape of your Data Set: (56414, 55)
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
Number of Numeric Columns = 17
Number of Integer-Categorical Columns = 11
Number of String-Categorical Columns = 4
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 2
Number of Numeric-Boolean Columns = 0
Number of Discrete String Columns = 13
Number of NLP String Columns = 1
Number of Date Time Columns = 0
Number of ID Columns = 0
Number of Columns to Delete = 7
55 Predictors classified...
This does not include the Target column(s)
21 variables removed since they were ID or low-information variables
Number of All Scatter Plots = 153
Time to run AutoViz (in seconds) = 93.869 ###################### VISUALIZATION Completed ########################
filename = "hourly_44201_2021.csv"
sep = ","
dft = AV.AutoViz(
filename,
sep=",",
depVar="",
dfte=None,
header=0,
verbose=0,
lowess=False,
chart_format="svg",
max_rows_analyzed=1500000,
max_cols_analyzed=60,
)
Shape of your Data Set: (878586, 24)
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
Number of Numeric Columns = 3
Number of Integer-Categorical Columns = 5
Number of String-Categorical Columns = 0
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 1
Number of Numeric-Boolean Columns = 0
Number of Discrete String Columns = 7
Number of NLP String Columns = 1
Number of Date Time Columns = 0
Number of ID Columns = 0
Number of Columns to Delete = 7
24 Predictors classified...
This does not include the Target column(s)
15 variables removed since they were ID or low-information variables
Number of All Scatter Plots = 6
Time to run AutoViz (in seconds) = 21.478 ###################### VISUALIZATION Completed ########################
filename = "daily_aqi_by_county_2021.csv"
sep = ","
dft = AV.AutoViz(
filename,
sep=",",
depVar="",
dfte=None,
header=0,
verbose=0,
lowess=False,
chart_format="svg",
max_rows_analyzed=1500000,
max_cols_analyzed=60,
)
Shape of your Data Set: (40518, 10)
############## C L A S S I F Y I N G V A R I A B L E S ####################
Classifying variables in data set...
Number of Numeric Columns = 0
Number of Integer-Categorical Columns = 4
Number of String-Categorical Columns = 2
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 0
Number of Numeric-Boolean Columns = 0
Number of Discrete String Columns = 4
Number of NLP String Columns = 0
Number of Date Time Columns = 0
Number of ID Columns = 0
Number of Columns to Delete = 0
10 Predictors classified...
This does not include the Target column(s)
4 variables removed since they were ID or low-information variables
Number of All Scatter Plots = 10
Time to run AutoViz (in seconds) = 5.089 ###################### VISUALIZATION Completed ########################
pd.read_csv('hourly_44201_2021.csv')
/Users/sumit/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3146: DtypeWarning: Columns (17) have mixed types. Specify dtype option on import or set low_memory=False. has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
| State Code | County Code | Site Num | Parameter Code | POC | Latitude | Longitude | Datum | Parameter Name | Date Local | ... | Units of Measure | MDL | Uncertainty | Qualifier | Method Type | Method Code | Method Name | State Name | County Name | Date of Last Change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 49 | 9991 | 44201 | 1 | 34.289001 | -85.970065 | NAD83 | Ozone | 2021-01-01 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | DeKalb | 2021-04-08 |
| 1 | 1 | 49 | 9991 | 44201 | 1 | 34.289001 | -85.970065 | NAD83 | Ozone | 2021-01-01 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | DeKalb | 2021-04-08 |
| 2 | 1 | 49 | 9991 | 44201 | 1 | 34.289001 | -85.970065 | NAD83 | Ozone | 2021-01-01 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | DeKalb | 2021-04-08 |
| 3 | 1 | 49 | 9991 | 44201 | 1 | 34.289001 | -85.970065 | NAD83 | Ozone | 2021-01-01 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | DeKalb | 2021-04-08 |
| 4 | 1 | 49 | 9991 | 44201 | 1 | 34.289001 | -85.970065 | NAD83 | Ozone | 2021-01-01 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Alabama | DeKalb | 2021-04-08 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 878581 | 56 | 39 | 1011 | 44201 | 1 | 44.565360 | -110.400340 | WGS84 | Ozone | 2021-02-28 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Wyoming | Teton | 2021-04-07 |
| 878582 | 56 | 39 | 1011 | 44201 | 1 | 44.565360 | -110.400340 | WGS84 | Ozone | 2021-02-28 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Wyoming | Teton | 2021-04-07 |
| 878583 | 56 | 39 | 1011 | 44201 | 1 | 44.565360 | -110.400340 | WGS84 | Ozone | 2021-02-28 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Wyoming | Teton | 2021-04-07 |
| 878584 | 56 | 39 | 1011 | 44201 | 1 | 44.565360 | -110.400340 | WGS84 | Ozone | 2021-02-28 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Wyoming | Teton | 2021-04-07 |
| 878585 | 56 | 39 | 1011 | 44201 | 1 | 44.565360 | -110.400340 | WGS84 | Ozone | 2021-02-28 | ... | Parts per million | 0.005 | NaN | NaN | FEM | 47 | INSTRUMENTAL - ULTRA VIOLET | Wyoming | Teton | 2021-04-07 |
878586 rows × 24 columns
dailyaqidf = pd.read_csv('daily_aqi_by_county_2021.csv')
dailyaqidf
| State Name | county Name | State Code | County Code | Date | AQI | Category | Defining Parameter | Defining Site | Number of Sites Reporting | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | DeKalb | 1 | 49 | 2021-01-01 | 30 | Good | Ozone | 01-049-9991 | 1 |
| 1 | Alabama | DeKalb | 1 | 49 | 2021-01-02 | 27 | Good | Ozone | 01-049-9991 | 1 |
| 2 | Alabama | DeKalb | 1 | 49 | 2021-01-03 | 34 | Good | Ozone | 01-049-9991 | 1 |
| 3 | Alabama | DeKalb | 1 | 49 | 2021-01-04 | 36 | Good | Ozone | 01-049-9991 | 1 |
| 4 | Alabama | DeKalb | 1 | 49 | 2021-01-05 | 31 | Good | Ozone | 01-049-9991 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 40513 | Wyoming | Teton | 56 | 39 | 2021-02-25 | 43 | Good | Ozone | 56-039-0008 | 3 |
| 40514 | Wyoming | Teton | 56 | 39 | 2021-02-26 | 42 | Good | Ozone | 56-039-0008 | 3 |
| 40515 | Wyoming | Teton | 56 | 39 | 2021-02-27 | 41 | Good | Ozone | 56-039-0008 | 3 |
| 40516 | Wyoming | Teton | 56 | 39 | 2021-02-28 | 42 | Good | Ozone | 56-039-0008 | 3 |
| 40517 | Wyoming | Teton | 56 | 39 | 2021-03-01 | 1 | Good | CO | 56-039-1013 | 1 |
40518 rows × 10 columns
dailyaqidf['County Code'].unique()
array([ 49, 73, 68, 1, 3, 5, 7, 12, 13, 15, 17, 19, 21,
23, 25, 27, 35, 51, 67, 101, 113, 119, 139, 143, 11, 29,
31, 43, 47, 53, 57, 61, 63, 65, 69, 71, 83, 87, 89,
95, 99, 103, 107, 111, 41, 45, 59, 77, 97, 123, 2, 115,
121, 127, 135, 151, 153, 185, 215, 231, 245, 247, 261, 303, 85,
117, 133, 157, 161, 163, 167, 179, 197, 201, 37, 39, 55, 91,
105, 129, 145, 147, 141, 169, 173, 183, 33, 137, 155, 177, 193,
175, 221, 229, 93, 9, 510, 81, 165, 75, 189, 79, 109, 131,
159, 199, 125, 227, 233, 251, 257, 309, 349, 361, 367, 373, 375,
381, 395, 397, 439, 453, 650, 670, 710])
pd.read_csv('annual_aqi_by_cbsa_2021.csv')
| CBSA | CBSA Code | Year | Days with AQI | Good Days | Moderate Days | Unhealthy for Sensitive Groups Days | Unhealthy Days | Very Unhealthy Days | Hazardous Days | Max AQI | 90th Percentile AQI | Median AQI | Days CO | Days NO2 | Days Ozone | Days SO2 | Days PM2.5 | Days PM10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Aberdeen, WA | 10140 | 2021 | 31 | 31 | 0 | 0 | 0 | 0 | 0 | 49 | 38 | 23 | 0 | 0 | 0 | 0 | 31 | 0 |
| 1 | Adrian, MI | 10300 | 2021 | 90 | 67 | 23 | 0 | 0 | 0 | 0 | 83 | 61 | 40 | 0 | 0 | 27 | 0 | 63 | 0 |
| 2 | Akron, OH | 10420 | 2021 | 59 | 47 | 12 | 0 | 0 | 0 | 0 | 74 | 62 | 33 | 0 | 0 | 0 | 6 | 53 | 0 |
| 3 | Albany, GA | 10500 | 2021 | 59 | 46 | 13 | 0 | 0 | 0 | 0 | 92 | 70 | 35 | 0 | 0 | 0 | 0 | 59 | 0 |
| 4 | Albany-Schenectady-Troy, NY | 10580 | 2021 | 90 | 74 | 16 | 0 | 0 | 0 | 0 | 86 | 56 | 39 | 0 | 0 | 54 | 0 | 36 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 315 | Worcester, MA-CT | 49340 | 2021 | 62 | 57 | 5 | 0 | 0 | 0 | 0 | 87 | 49 | 37 | 0 | 0 | 40 | 0 | 19 | 3 |
| 316 | Yakima, WA | 49420 | 2021 | 31 | 10 | 21 | 0 | 0 | 0 | 0 | 87 | 72 | 62 | 0 | 0 | 0 | 0 | 31 | 0 |
| 317 | York-Hanover, PA | 49620 | 2021 | 81 | 80 | 1 | 0 | 0 | 0 | 0 | 64 | 42 | 35 | 0 | 0 | 81 | 0 | 0 | 0 |
| 318 | Youngstown-Warren-Boardman, OH-PA | 49660 | 2021 | 78 | 66 | 12 | 0 | 0 | 0 | 0 | 74 | 59 | 38 | 0 | 0 | 43 | 0 | 35 | 0 |
| 319 | Yuma, AZ | 49740 | 2021 | 90 | 74 | 16 | 0 | 0 | 0 | 0 | 94 | 58 | 44 | 0 | 0 | 61 | 0 | 8 | 21 |
320 rows × 19 columns
pd.read_csv('annual_aqi_by_county_2021.csv')
| State | County | Year | Days with AQI | Good Days | Moderate Days | Unhealthy for Sensitive Groups Days | Unhealthy Days | Very Unhealthy Days | Hazardous Days | Max AQI | 90th Percentile AQI | Median AQI | Days CO | Days NO2 | Days Ozone | Days SO2 | Days PM2.5 | Days PM10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | DeKalb | 2021 | 58 | 58 | 0 | 0 | 0 | 0 | 0 | 49 | 42 | 32 | 0 | 0 | 58 | 0 | 0 | 0 |
| 1 | Alabama | Jefferson | 2021 | 60 | 33 | 27 | 0 | 0 | 0 | 0 | 93 | 64 | 50 | 1 | 1 | 2 | 0 | 55 | 1 |
| 2 | Alaska | Denali | 2021 | 59 | 59 | 0 | 0 | 0 | 0 | 0 | 43 | 41 | 39 | 0 | 0 | 59 | 0 | 0 | 0 |
| 3 | Arizona | Apache | 2021 | 87 | 87 | 0 | 0 | 0 | 0 | 0 | 25 | 19 | 11 | 0 | 0 | 0 | 0 | 1 | 86 |
| 4 | Arizona | Cochise | 2021 | 90 | 77 | 12 | 1 | 0 | 0 | 0 | 104 | 54 | 41 | 0 | 0 | 48 | 0 | 0 | 42 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 598 | Wyoming | Albany | 2021 | 59 | 59 | 0 | 0 | 0 | 0 | 0 | 47 | 45 | 43 | 0 | 0 | 59 | 0 | 0 | 0 |
| 599 | Wyoming | Campbell | 2021 | 89 | 89 | 0 | 0 | 0 | 0 | 0 | 44 | 41 | 34 | 0 | 1 | 88 | 0 | 0 | 0 |
| 600 | Wyoming | Sublette | 2021 | 90 | 89 | 1 | 0 | 0 | 0 | 0 | 77 | 47 | 43 | 0 | 0 | 90 | 0 | 0 | 0 |
| 601 | Wyoming | Sweetwater | 2021 | 90 | 83 | 7 | 0 | 0 | 0 | 0 | 83 | 40 | 10 | 0 | 0 | 0 | 0 | 0 | 90 |
| 602 | Wyoming | Teton | 2021 | 60 | 60 | 0 | 0 | 0 | 0 | 0 | 46 | 43 | 38 | 1 | 0 | 59 | 0 | 0 | 0 |
603 rows × 19 columns
pd.read_csv('daily_42101_2021.csv')
| State Code | County Code | Site Num | Parameter Code | POC | Latitude | Longitude | Datum | Parameter Name | Sample Duration | ... | AQI | Method Code | Method Name | Local Site Name | Address | State Name | County Name | City Name | CBSA Name | Date of Last Change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 73 | 23 | 42101 | 2 | 33.553056 | -86.815000 | WGS84 | Carbon monoxide | 1 HOUR | ... | NaN | 93.0 | INSTRUMENTAL - GAS FILTER CORRELATION CO ANALYZER | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 1 | 1 | 73 | 23 | 42101 | 2 | 33.553056 | -86.815000 | WGS84 | Carbon monoxide | 1 HOUR | ... | NaN | 93.0 | INSTRUMENTAL - GAS FILTER CORRELATION CO ANALYZER | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 2 | 1 | 73 | 23 | 42101 | 2 | 33.553056 | -86.815000 | WGS84 | Carbon monoxide | 1 HOUR | ... | NaN | 93.0 | INSTRUMENTAL - GAS FILTER CORRELATION CO ANALYZER | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 3 | 1 | 73 | 23 | 42101 | 2 | 33.553056 | -86.815000 | WGS84 | Carbon monoxide | 1 HOUR | ... | NaN | 93.0 | INSTRUMENTAL - GAS FILTER CORRELATION CO ANALYZER | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 4 | 1 | 73 | 23 | 42101 | 2 | 33.553056 | -86.815000 | WGS84 | Carbon monoxide | 1 HOUR | ... | NaN | 93.0 | INSTRUMENTAL - GAS FILTER CORRELATION CO ANALYZER | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14880 | 56 | 39 | 1013 | 42101 | 1 | 44.373056 | -110.830833 | WGS84 | Carbon monoxide | 8-HR RUN AVG END HOUR | ... | 1.0 | NaN | - | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 14881 | 56 | 39 | 1013 | 42101 | 1 | 44.373056 | -110.830833 | WGS84 | Carbon monoxide | 8-HR RUN AVG END HOUR | ... | 1.0 | NaN | - | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 14882 | 56 | 39 | 1013 | 42101 | 1 | 44.373056 | -110.830833 | WGS84 | Carbon monoxide | 8-HR RUN AVG END HOUR | ... | 1.0 | NaN | - | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 14883 | 56 | 39 | 1013 | 42101 | 1 | 44.373056 | -110.830833 | WGS84 | Carbon monoxide | 8-HR RUN AVG END HOUR | ... | 1.0 | NaN | - | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 14884 | 56 | 39 | 1013 | 42101 | 1 | 44.373056 | -110.830833 | WGS84 | Carbon monoxide | 8-HR RUN AVG END HOUR | ... | 1.0 | NaN | - | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
14885 rows × 29 columns
pd.read_csv('daily_42602_2021.csv')
| State Code | County Code | Site Num | Parameter Code | POC | Latitude | Longitude | Datum | Parameter Name | Sample Duration | ... | AQI | Method Code | Method Name | Local Site Name | Address | State Name | County Name | City Name | CBSA Name | Date of Last Change | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 73 | 23 | 42602 | 1 | 33.553056 | -86.815000 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 22 | 200 | Teledyne-API Model 200EUP or T200UP - Photolyt... | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 1 | 1 | 73 | 23 | 42602 | 1 | 33.553056 | -86.815000 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 8 | 200 | Teledyne-API Model 200EUP or T200UP - Photolyt... | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 2 | 1 | 73 | 23 | 42602 | 1 | 33.553056 | -86.815000 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 27 | 200 | Teledyne-API Model 200EUP or T200UP - Photolyt... | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 3 | 1 | 73 | 23 | 42602 | 1 | 33.553056 | -86.815000 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 25 | 200 | Teledyne-API Model 200EUP or T200UP - Photolyt... | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| 4 | 1 | 73 | 23 | 42602 | 1 | 33.553056 | -86.815000 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 29 | 200 | Teledyne-API Model 200EUP or T200UP - Photolyt... | North Birmingham | NO. B'HAM,SOU R.R., 3009 28TH ST. NO. | Alabama | Jefferson | Birmingham | Birmingham-Hoover, AL | 2021-04-20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12291 | 56 | 39 | 1013 | 42602 | 1 | 44.373056 | -110.830833 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 3 | 74 | INSTRUMENTAL - CHEMILUMINESCENCE | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 12292 | 56 | 39 | 1013 | 42602 | 1 | 44.373056 | -110.830833 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 3 | 74 | INSTRUMENTAL - CHEMILUMINESCENCE | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 12293 | 56 | 39 | 1013 | 42602 | 1 | 44.373056 | -110.830833 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 2 | 74 | INSTRUMENTAL - CHEMILUMINESCENCE | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 12294 | 56 | 39 | 1013 | 42602 | 1 | 44.373056 | -110.830833 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 4 | 74 | INSTRUMENTAL - CHEMILUMINESCENCE | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
| 12295 | 56 | 39 | 1013 | 42602 | 1 | 44.373056 | -110.830833 | WGS84 | Nitrogen dioxide (NO2) | 1 HOUR | ... | 3 | 74 | INSTRUMENTAL - CHEMILUMINESCENCE | Yellowstone National Park - Old Faithful Snow ... | Yellowstone National Park - Old Faithful Snow ... | Wyoming | Teton | Not in a city | Jackson, WY-ID | 2021-04-07 |
12296 rows × 29 columns